CC_Module3

Author

Alfredo Aro Terleira

Word Clouds

Word Cloud is an image composed of words that occur in a particular text or subject. The size of a word indicates its frequency or its importance.

To create a word cloud, we will need a text file.

Paso 1: Descargamos el archivo

  • Es importante escribir de forma correcta la ruta donde queremos descargar el archivo

  • Debemos instalar dos librerías importantes: tm y wordcloud

  • tm: stand for text mining, this package will transform the text into a format that’s handled by R

  • wordcloud: with this package will create the visualization

library(tm)
Warning: package 'tm' was built under R version 4.4.2
Cargando paquete requerido: NLP
Warning: package 'NLP' was built under R version 4.4.2
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.4.2
Cargando paquete requerido: RColorBrewer
dir.create("C:/Users/USUARIO/Documents/GitHub/CC_Module3/wordcloud")
Warning in
dir.create("C:/Users/USUARIO/Documents/GitHub/CC_Module3/wordcloud"):
'C:\Users\USUARIO\Documents\GitHub\CC_Module3\wordcloud' already exists
download.file("https://ibm.box.com/shared/static/cmid70rpa7xe4ocitcga1bve7r0kqnia.txt",
              destfile = "C:/Users/USUARIO/Documents/GitHub/CC_Module3/wordcloud/churchill_speeches.txt", quiet = TRUE)

Paso 2: Directorio

  • Es importante seleccionar el directorio, pero siempre lo hacemos desde Session -> Set Working Directory -> Choose Directory
#Seleccionamos el directorio de forma manual
dirPath <- "C:/Users/USUARIO/Documents/GitHub/CC_Module3/wordcloud"

#Lo cargamos
speech <- Corpus(DirSource(dirPath))
  • Chequeamos la estructura de nuestro text Corpus
inspect(speech)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            churchill_speeches.txt 
At present we lie within a few minutes<92> striking distance of the French, Dutch and Belgian coasts, and within a few hours of the great aerodromes of Central Europe. We are even within canon-shot of the Continent.\n\nSo close as that! Is it prudent, is it possible, however much we might desire it, to turn our backs upon Europe and ignore whatever may happen there? I have come to the conclusion <96> reluctantly I admit <96> that we cannot get away. Here we are and we must make the best of it. But do not underrate the risks <96> the grevious risks <96> we have to run.\n\n\nThis is only the beginning of the reckoning. This is only the first sip, the first foretaste of a bitter cup which will be proffered to us year by year unless, by a supreme recovery of moral health and martial vigour, we arise again and take our stand for freedom as in the olden time.\n\n\nI would say to the House, as I said to those who have joined this Government: I have nothing to offer but blood, toil, tears and sweat. We have before us an ordeal of the most grievous kind. We have before us many, many long months of struggle and of suffering. You ask, what is our policy? I can say: It is to wage war, by sea, land and air, with all our might and with all the strength  that God can give us; to wage war against a monstrous tyranny, never surpassed in the dark, lamentable catalogue of human crime. This is our policy. You ask, what is our aim?\n\nI can answer in one word: It is victory, victory at all costs, victory in spite of all terror, victory, however long and hard the road may be, for without victory, there is no survival.\n\n\nEven though large tracts of Europe and many old and famous states have fallen or may fall into the grip of the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender.\n\n\nThe battle of France is over. I expect that the Battle of Britain is about to begin. Upon this battle depends the survival of Christian civilisation. Upon it depends our own British life, and the long continuity of our institutions and our Empire. The whole fury and might of the enemy must very soon be turned upon us. Hitler knows that he will have to break us in this island or lose the war.\n\nIf we can stand up to him, all Europe may be free and the life of the world may move forward into broad, sunlit uplands. But if we fail, then the whole world, including the United States, including all that we have known and cared for, will sink into the abyss of a new Dark Age made more sinister, and perhaps more protracted, by the lights of perverted science. Let us therefore brace ourselves to our duties, and so bear ourselves that, if the British Empire and its Commonwealth last for a thousand years, men will still say, <91>This was their finest hour.\n\n\nThe gratitude of every home in our Island, in our Empire, and indeed throughout the world, except in the abodes of the guilty, goes out to the British airmen who, undaunted by odds, unwearied in their constant challenge and mortal danger, are turning the tide of the World War by their prowess and by their devotion. Never in the field of human conflict was so much owed by so many to so few.\n\n\nFrom Stettin in the Baltic to Trieste in the Adriatic, an iron curtain has descended across the Continent. Behind that line lie all the capitals of the ancient states of Central and Eastern Europe. Warsaw, Berlin, Prague, Vienna, Budapest, Belgrade, Bucharest and Sofia, all these famous cities and the populations around them lie in what I must call the Soviet sphere.\n\n\nI am very glad that Mr Attlee described my speeches in the war as expressing the will not only of Parliament but of the whole nation. Their will was resolute and remorseless and, as it proved, unconquerable. It fell to me to express it, and if I found the right words you must remember that I have always earned my living by my pen and by my tongue. It was the nation and race dwelling all round the globe that had the lion heart. I had the luck to be called upon to give the roar. 

Paso 3: Data Cleaning

  1. Convert the text to lower case
speech <- tm_map(speech, content_transformer(tolower))
  1. Remove all numbers with the 'removeNumbers'
speech <- tm_map(speech,removeNumbers)
  1. Remove common stop words like ‘the’ and ‘we’
speech <- tm_map(speech, removeWords, stopwords("english"))
  1. You can even remove your own stop words by specifying the words in a character vector
speech <- tm_map(speech, removeWords, c("floccinaucinihilipification", "squirrelled"))
  1. Remove punctuation with the 'removePunctuation'
speech <- tm_map(speech, removePunctuation)
  1. Remove unnecessary whitespace with 'stripwhitespace'
speech <- tm_map(speech, stripWhitespace)

Paso 4: Term Document Matrix

Next step is to create a term document matrix, which is a table that contains the frequency of the words. We will use 'TermDocumentMatrix'

#Create a Term Document Matrix
dtm <- TermDocumentMatrix(speech)

#Matrix transformation
m <- as.matrix(dtm)

#Sort it to show the most frequent words
v <- sort(rowSums(m), decreasing=TRUE)

#transform to a data frame
d <- data.frame(word = names(v), freq=v)
head(d,10)
           word freq
shall     shall   11
fight     fight    7
may         may    6
will       will    6
europe   europe    5
upon       upon    5
victory victory    5
war         war    5
can         can    4
many       many    4

Paso 5: Simple Word Cloud

wordcloud(words = d$word, freq = d$freq)

Paso 6: Frequency

You can also adjust the number of words by specifying the minimum frequency

wordcloud(words = d$word, freq = d$freq,
          min.freq=1)
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): famous could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): confidence
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): including
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): europe could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): victory
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): never could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): dwelling
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): beginning
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): striking
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): belgrade
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): can could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): populations
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): foretaste
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): stettin
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): minutes
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): remorseless
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): blood could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): civilisation
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): conclusion
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): institutions
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): prudent
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): underrate
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): depends
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): catalogue
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): knows could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): france could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): vigour could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): even could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): martial
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): make could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): forward
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): reckoning
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): wage could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): lights could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): possible
could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): close could
not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1): suffering
could not be fit on page. It will not be plotted.

Paso 7: Max Words

You can impose a limit on the number of words that can be displayed

wordcloud(words = d$word, freq = d$freq,
          min.freq = 1, max.words=250)
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250): upon could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250): shall could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250): europe could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250): reckoning could not be fit on page. It will not be plotted.

Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250): empire could not be fit on page. It will not be plotted.

Paso 8: Colors

#install.packages("RColorBrewer")
library(RColorBrewer)
library(wordcloud)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 250, colors = brewer.pal(8, "Dark2"))
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : policy could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : martial could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : never could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : wage could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : hours could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : shall could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : catalogue could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : adriatic could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : supreme could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : sweat could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : bitter could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : strength could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : challenge could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : lamentable could not be fit on page. It will not be plotted.

Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : freedom could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : distance could not be fit on page. It will not be plotted.

Paso 9: Centered Word Cloud

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=250, colors = brewer.pal(8, "Dark2"), random.order = FALSE)
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : supreme could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : surrender could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : therefore could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : thousand could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : tongue could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : unconquerable could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : undaunted could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : underrate could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : united could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : unless could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : unwearied could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : uplands could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : vigour could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : without could not be fit on page. It will not be plotted.
Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
250, : words could not be fit on page. It will not be plotted.

Radar Charts

  • Radar charts are a way to display multivariate data within one plot

Paso 1: Instalación de librerias

Importante: El paquete ggradar se torna algo complicado de instalar, pero resulta más fácil si se realiza a través de github de la siguiente forma:

#devtools::install_github("ricardo-bion/ggradar", dependencies = TRUE)
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.2

Adjuntando el paquete: 'ggplot2'
The following object is masked from 'package:NLP':

    annotate
library(ggradar)
library(dplyr)

Adjuntando el paquete: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(scales)
Warning: package 'scales' was built under R version 4.4.2
library(tibble)

Paso 2: Nuestra base de datos

mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Paso 3: Radar Chart

#Select our dataset
mtcars %>%
  #atribute rownames to a variable
  add_rownames( var = "group") %>%
  #assign each varaible -- car names -- to their related variables
  mutate(across(where(is.numeric), rescale)) %>%
  #select which data to plot
  head(3) %>% select(1:10) -> mtcars_radar
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
#this code will generate lots of warning, so let's supress them
options(warn=-1)
ggradar(mtcars_radar)

Paso 4: Output

Si queremos mejorar la visualización, pero se ve igual. Así que no pasa nada.

#Debos instalar lo siguiente

#devtools::install_github("IRkernel/IRkernel")


#IRkernel::set_plot_options(width=950, height=600, units='px')
#ggradar(mtcars_radar)

Waffle Charts

Waffle charts are a great way to visualize data in relation to a whole or to highlight progress against a given threshold.

Paso 1: Libraries

library(ggplot2)
library(waffle)

Paso 2: Implementation in R

  1. Firs, we need to create a name vector with the household spending data from before
expenses <- c('Health ($43,212)' = 43212,
              'Education ($113,412)' = 113412,
              'Transportation ($20,231)' = 20231,
              'Entertainment ($28,145)' = 28145)
  1. To create our waffle chart, we will use the 'waffle' method
  • expenses/1235: se utiliza como factor de normalización. Se busca reducir las cifras de gasto en una escala adecuada para que el gráfico de waffle tenga un número de cuadrados visualmente representativo, sin que sea demasiado grande o pequeño.
waffle(expenses/1235, rows=5, size=0.3,
       colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"),
       title="Imaginary Household Expenses Each Year",
       xlab = "1 square = $934")

Paso 3: Mejora

Esto era si queriamos mejorar la visualización, pero luce igual, así que no hay problema.

#Debos instalar lo siguiente

#devtools::install_github("IRkernel/IRkernel")
#IRkernel::options(repr.plot.width = 9.5, repr.plot.height = 6)
#waffle(expenses/1235, rows=5, size=0.3,
       #colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"),
       #title="Imaginary Household Expenses Each Year",
       #xlab = "1 square = $934")

Box Plots

A box plot summarizes the distribution of sorted numerical data.

  • The first quartile is the point 25% of the way through the sorted data.

  • In other words, a quarter of the data points are less than this value.

  • Similarly, 75% of the points are less than the third quartile value.

  • The interquartile range is simply the difference between the first and third quartile

  • The median is effectively the second quartile

  • The lower and upper whiskers indicate values outside the interquartile range

An example…

Paso 0: Example Data Frame

In order to reproduce the results, we are going to fix the seed value for the random number generator. So the data will appear random, but it will be the same very time the code is run.

#making the results reproducible
set.seed(1234)
  1. Create two sets of data
set_a <- rnorm(200, mean=1, sd=2)
set_b <- rnorm(200, mean=0, sd=1)
  • Set A is sampled from the normal distribution with mean 1, standard deviation 2

  • Set B has mean 0, standard deviation 1

  1. We will place these sets into a df. Separate them by label
#create the data frame
df <- data.frame(label = factor(rep(c("A","B"), each=200)), value = c(set_a,set_b))
head(df)
  label     value
1     A -1.414131
2     A  1.554858
3     A  3.168882
4     A -3.691395
5     A  1.858249
6     A  2.012112
tail(df)
    label       value
395     B  0.52874502
396     B  0.78939440
397     B  0.45709951
398     B  0.53883312
399     B  0.01464312
400     B -0.91648914

Paso 0.1: Importing packages

library(ggplot2)
library(plotly)

Adjuntando el paquete: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

Paso 0.2: geom_boxplot()

ggplot(df, aes(x=label, y=value)) + geom_boxplot()

ggplotly()

Ahora exploremos con la data de mtcars…

Paso 1: Revisamos la base de datos

We are going to work with the first two variables in the top row: miles per gallon (mpg) and number of cylinders (cyl)

summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Paso 2: Creating box plots using qplot()

La variable cyl representa más una variable categórica que numérica, con la cual se pueden agrupar el resto de valores.

qplot(factor(cyl), mpg, data = mtcars, geom = "boxplot")

Paso 3: Creating box plots using ggplot()

cars <- ggplot(mtcars, aes(factor(cyl), mpg))
cars + geom_boxplot()